Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification
نویسندگان
چکیده
This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the zeroorder and first-order Baum-Welch statistics are Gaussian Mixture Model (GMM) components trained from acoustic level MFCC features. Yet besides MFCC, we believe that phonetic information makes another direction that can benefit the system performance. Our contribution in this paper lies in integrating phonetic information into the i-vector representation by several extensions, forming a more generalized i-vector framework. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained GMM components to phonetic phonemes, trigrams and tandem feature trained GMM components, using phoneme posterior probabilities. Second, given the zero-order statistics (posterior probabilities on tokens), the feature used to calculate the first-order statistics is also extended from MFCC to tandem feature, and is not necessarily the same feature employed by the tokenizer. Third, the zero-order and first-order statistics vectors are then concatenated and represented by the simplified supervised i-vector approach followed by the standard Probabilistic Linear Discriminant Analysis (PLDA) back-end. We study different token and feature combinations, and we show that the feature level fusion of acoustic level Ming Li (B) SYSU-CMU Joint Institute of Eng., Sun Yat-Sen University SYSU-CMU Shunde International Joint Research Institute E-mail: [email protected] Wenbo Liu SYSU-CMU Joint Institute of Eng., Sun Yat-Sen University Department of ECE, Carnegie Mellon University E-mail: [email protected] MFCC features and phonetic level tandem features with GMM based i-vector representation achieves the best performance for text independent speaker verification. Furthermore, we demonstrate that the phonetic level phoneme constraints introduced by the tandem features help the text dependent speaker verification system to reject wrong password trials and improve the performance dramatically. Experimental results are reported on the NIST SRE 2010 common condition 5 female part task and the RSR 2015 part 1 female part task for text independent and text dependent speaker verification, respectively. For the text independent speaker verification task, the proposed generalized i-vector representation outperforms the i-vector baseline by relatively 53% in terms of equal error rate (EER) and norm minDCF values. For the text dependent speaker verification task, our proposed approach also reduced the EER significantly from 23% to 90% relatively for different types of trials.
منابع مشابه
Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features
This paper presents a generalized i-vector framework with phonetic tokenizations and tandem features for speaker verification as well as language identification. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained Gaussian Mixture Models (GMM) components to phonetic phonemes, 3-grams and tandem feature trained GMM components using phoneme posterior prob...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملDNN i-Vector Speaker Verification with Short, Text-Constrained Test Utterances
We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with “phonetically aware” Deep Neural Net (DNN) in its cap...
متن کاملMulti-task learning for text-dependent speaker verification
Text-dependent speaker verification uses short utterances and verifies both speaker identity and text contents. Due to this nature, traditional state-of-the-art speaker verification approaches, such as i-vector, may not work well. Recently, there has been interest of applying deep learning to speaker verification, however in previous works, standalone deep learning systems have not achieved sta...
متن کاملI-Vector/PLDA Variants for Text-Dependent Speaker Recognition
The i-vector/PLDA approach currently dominates the field of text-independent speaker recognition and the question of how to translate this methodology to the text-dependent domain has recently become an active area of research. The essential difference between the two fields is that it is possible to do speaker recognition with enrollment and test utterances of very short duration in the text-d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Signal Processing Systems
دوره 82 شماره
صفحات -
تاریخ انتشار 2016